Hadoop--eclipse写MapReduce代码在Hadoop上执行单词统计
一、需要的jar包
Hadoop-2.4.1\share\hadoop\hdfs\hadoop-hdfs-2.4.1.jar
hadoop-2.4.1\share\hadoop\hdfs\lib\所有jar包
hadoop-2.4.1\share\hadoop\common\hadoop-common-2.4.1.jar
hadoop-2.4.1\share\hadoop\common\lib\所有jar包
hadoop-2.4.1\share\hadoop\mapreduce\除hadoop-mapreduce-examples-2.4.1.jar之外的jar包
hadoop-2.4.1\share\hadoop\mapreduce\lib\所有jar包
二、代码
mapper类
package kgc.mapred; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { static IntWritable one = new IntWritable(1); static Text word =new Text(""); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer words =new StringTokenizer(value.toString()); //String[] words = value.toString().split(); while(words.hasMoreTokens()) { word.set(words.nextToken()); context.write(word, one); } } }
reduce类
package kgc.mapred; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int count = 0; for (IntWritable num : values) { count = count + num.get(); } context.write(key, new IntWritable(count)); } }
提交main类
package kgc.mapred; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static void main( String[] args ) throws Exception { //Hadoop配置 Configuration cfg = new Configuration(); //作业 Job job = Job.getInstance(cfg, "WordCountMR" ); //设置包含Mapper和Reducer定义的jar job.setJar("wordcount-0.0.1.jar"); //job.setJarByClass(WordCount.class); //map任务处理类 job.setMapperClass(WordCountMapper.class); //reduce任务处理类 job.setReducerClass(WordCountReducer.class); //初始输入格式 job.setInputFormatClass(TextInputFormat.class); // \r return \n newline //初始输入文件 FileInputFormat.addInputPath(job, new Path(args[0])); //最终输出格式 job.setOutputFormatClass(TextOutputFormat.class); //最终输出路径 FileOutputFormat.setOutputPath(job, new Path(args[1])); //map、reduce统一输出类型. job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //作业执行 job.waitForCompletion(true); } }
三、上传jar包在Hadoop中运行
1.如果你是用maven的project,那么可以直接在run as中的maven install生成相应的jar包。其余步骤相同。
2.如果是普通的java项目,在File-->Export-->Runnable jar file-->保存路径-->单选第一个按钮-->生成jar包
利用xshell将jar文件上传到虚拟机即可。
3.上传要计算的文件,同样用xshell上传到虚拟机后,利用如下命令放在HDFS下:
hadoop fs -put /文件当前路径 /放在HDFS里的路径
4.运行jar文件并计算文件内容:命令如下:
yarn jar jar文件 /需要计算的文件 /文件输出的路径(确保文件不存在)
5.接下来会有job任务显示,并对进度进行提示,最后给出完成报告。
当然,不仅在控制台可以看到,在Hadoop的后台一样可以看到。只需在网页上输入 IP:8080,即可查看计算进度,或IP:50070都可以查看文件生成情况。